Input a data set D consisting of text documents, each labeled as belonging 
to 0 or more classes from a set or hierarchy of classes S. 



Construct a single vector representation of text features extracted or 
associated with each text document in D, 



120 



For each labeled text document in data set D, create a training set T(D) by 
labeling the vector by the same set of classes used to label the text 

documents. 



130 



For each labeled text document in data set D, create a to induce 
classification methods that can be used to assign classes in S to a hitherto 
unseen feature vector with the same structure as those in 7(D). 



140 



For each class in S, output the classification methods that can be used to 
assign that class to a hitherto unseen text document by applying the 

methods to a feature vector derived from there text document in the same 
way that the feature vectors in T(D) were derived from the data set D. 



150 



Prior Art 




100 



Figure 1 



Media 

item 200 




Figure 2 



305 
Visual track 



L - 



LP" 

J 



320 



I 



Select key frames / key 



intervals 



325 



Transform visual data into 
visual feature spaces 



330 



j Quantize visual feature values j 
1 into discrete values ! 



335 



Compute a visual feature 
vector 



300 



310 
Audio track 




350 



Transcribe speech track 



355 



Optionally include: 
closed- & open-captioning 



360 



Tokenization, and optionally j 
stemming 



365 



Compute a textual feature 
vector 



i 



Compute visual / textual / (audio) feature vector 



370 



t 

380 



Figure 3 



Input a data set D consisting of media items, each labeled as belonging to 0 
or more classes from a set or hierarchy of classes S. 



410 

._ V 

Construct a single vector representation of text features and/or audio 
features extracted or associated with each media item in D. 



400 





420 

r 


Construct a single vector representation of visual features extracted or 
associated with each media item in D. 




430 

r 



For each labeled media item in data set D, create a training set T(D) by 
combining the 2 vector representations of that media item (constructed in 

the preceding 2 steps) into a single composite feature vector, with the 
resulting vector labeled by the same set of classes used to label the media 

item. 



440 

_^______^_„ 

Use a supervised learning technique, with T(D) as training set, to induce a 
classifier that can be used to assign classes in Sto a hitherto unseen 
composite feature vector with the same structure as those in 7(D). 

450 

i r _ 

Output the classifier induced in the preceding step. 



460 



Figure 4 



510 
Visual: 



512 

TSpeech 
i transcript: 



"golf" 



514 
Speech 
transcript: 



grass 



7 




520 



Category: 



: 525 



Sports 

i : 



Subject: 



530 



Golf 



540 



550 



545 ; Category: 



Sports 



548 | Subject: 



Golf game 



I Category: 



Sports 



| Subject: 



Interview about golf 



555 



558 



Figure 5 



/ (Labeled) 



/ reference / 
media /__ 
items / 

D I 

1, ..... N I 



/ 



610 



! I 



Training / 
learning 



620 



Class 
(category) 
representations 



675 



Training / learning phase 610 
A 



675 

Class 
(category) 
representations 



Target media item M 
660 



Classification 
(categorization) 
engine 




650 



685 



Classification phase 615 



B 



Figure 6 



770 



delta^ 

4 ► 




Key frame or key interval 
selection 



Media item 

Time 



Osecs 



Quantize key interval j 
(domain) 
(regions) 7Q6 

: i 



Transformation 



707 



750 



T sees 



755 



Compute region feature j 

708 \ 



Visual feature values 775 



Figure 7A 



770 




] 
| 

i Transformation 

1 

i 717 



Quantize domain 
(regions) 



Compute region feature 
708 



Visual feature values 775 



Figure 7B 



(Part of) 
media item 
810 




Figure 8 



Media item 
(frame) 
900 



901 

1 


902 

2 


3 


m 

j 

1 


m 


m 


m 


■ 


m 

| 
i 


m 


903 

k 


■ 

1 


i 

■ 






904 

16 



Frame region (k) 910 



Figure 9 



Video frame 950 




960 



Figure 9A 



Video stream 1010 




1008 



1020 



i 



Video transformation 

i 




Transformed media 
1030 



Figure 10 



t 0 760 









Video stream 750 I 


1 


Frame 1110, YIQ (frame) 
r -, 1175 | 



YIQ > HSV 

1115 



Hue (frame) 1120 



Rectangular 




windows 




1130 






Hue (window) ' 




1135 I 




r 





Averaging 
1140 



Average_hue (window) 
1145 



Code assignment 


J ^Feature vector 


1 150 


■ r 1155 



Figure 11A 



60° 
1162 



U 

m 
m 
w 

m 



p 

if! 

o 

O 

u 




0° = 360° 
1160 1165 



330° 
1164 







240° 


270° 


300° 










Hue quantization 


(1180) 






Color range 


Code 


Color 




330- 


30 


degrees 


0 


Red 


(1170) 


30- 


90 


degrees 


1 


Yellow 


(1171) 


90- 


150 


degrees 


2 


Green 


(1172) 


150- 


210 


degrees 


3 


Cyan 


(1173) 


210- 


270 


degrees 


4 


Blue 


(1174) 


270- 


330 


degrees 


5 


Magenta 


(1175) 



Figure 11B 



Visual part of media item 
1210 



Spatial and temporal quantization 

i 



1220 



Feature computation 



1230 



Feature quantization / 
code assignment 



1240 







Mapping / counting 



1250 



Visual feature vector F v 
1260 



Figure 12 




1303 



1306 



1309 



1310 



, 1 


1 


1 


1 ! 


; 1 


5 


5 


1 


\ 1 


5 


5 


1 ! 

— i 


I 

2 


2 


2 


2| 



Z\ 1335 

_ 

8 ' 4 0 0 4 0 




B 



1330 




Figure 13 




Figure 13D 



Media 
item 
1310 



Select 
key frames/ 
intervals 

1401 



-k, i= 1,..., n-+. 
1402 I 



Order 
key frames/ 
intervals 

1403 



—k , j = 1,..., n-> 
1320 



1353 



1356 



1359 



1360 



1320 < 



, Start ! 1410 
t 



/, j, k= 1 ! 1412 

zx::.: 



F (k)=C ; 1414 



1418 no 



/ = /+f; k= k+ 1 ; 1416 



1424 no 




\ End ^ 1428 



Figure 14A 



Shorten 



1430 



F 

V 

1432 



1432 



i Start ; 1441 



Start 



i 



1471 



Input: 

W,NMm,m c ,E v 1442 
/ = w = s = 1 



Input: 

W,N,M,m,m f ,F v 
i = w = s = 1 



1472 



1474 



1452 



F v (s) = F v (w) 1444 
Jr 1 446 



-H FJs) =avg{FJ.w)+EJw+W)+...+F v (w+(m f -VW)) 



w= w+ m : 
c 




s = s + 1 



1482 



1448 



1480 



w = w + m , 1 476 



1478 



-no- 



/ = / + 1 
w = / m + 1 



1456 



no 



i 1460/ 

L S s> M 




1458 



no 



1490 




1486 



1488 



yes 


f 1462 


! Output: 


F 

V 



yes 


1492 







Output: F 1494 



iJBtop j 1466 
1440 



J Stop 1 496 
1470 



Figure 14B 



F v 1510 



F t 1520 



1 



Optional Optional 
transformation , transformation 

1 530 1 540 



t 

Vector combination / concatenation 
1550 



F 



1560 



Figure 15 



1631 



1603 



Category 



NTSC 



' + Frame Grabber 



\ 

1607' \ 



MPEG1 
MPEG2 - 
file or stream 

1609 



AVI 

" file 



1613 



XXX 
file 



1619 



MPEG Decoder 

1611 

AVI Decoder 

1617 



xxx [Decoder 

1621 




RGB/ YIQ 
*j\ Frames 

1629 



Categorization 
engine 



1633 



1623 } 

RealVideo Reaf video Decoder j 

stream or file | 

1627 i 



Category 

1637 



Figure 16 



1715 
-Time, t- 



1700 



Textual information, 1710 
Visual information, 1720 



-Frame number, n- 
1725 



1730 



1735 



7 1718 



N 1728 



1 740 1 745 



F 

1755 



1750 



-M(0- 



1715 

-Time, t- 



1780 



Textual information, 1710 



1790 



Visual information, 1720 



1754 



1761 



n 0 



1762 



n 



Frame number, rr 
1725 



-T 0 1785- 
-N Q 1795- 



1763 1764 
► 



n 0 



1765 1766 1767 1768 



n N 
176 | 1758 



Fin). F(t) 
1775 



B 



C(A7), C(f) 
1780 



1797 
-Word count, w- 



1799 



C(w) 1785 



Figure 17 



Streaming 
data storage 



\ 1810 



M(t), 1750 



P 

m 
m 
m 

m 
!- 

!•» 

o 
m 

o 

Q 



Block process 



1820 



B{t), 1830 



Vector extraction and 
combining process 



1840 



F[t), 1850 



Application process 



1860 



C(f), 1870 



Aggregation process 
(optional) 



1880 



C(t), 1890 



Figure 18 



o 

CO 

o 



1900 



1910 

US 

President 



1920 
weath 



1915 

European c 

Union ; 



er 



1905 



1925 

_Free_ 
trade 



t, n, w 
1906 



1935 
_w_eath_ 
- er 



_Crime_in_the 
cities 
1930 



1940 

Baseball 



league 



T, N, W 
1903 



1950 



1970 



B(t) 1962 
| 1991 
F(f) 1 963 

i 

C(f) 1 964 



1994 



t 

1996 



1985 




1990 



t, n, w 
1906 

B 



T, N, W 
1903 



Figure 19 



1910 



1915 



1920 



1905 C t (t) US President = c n j European Union = c 2 j weather = c 3 

1955 C (t) C 1 °8 C 1 C 1 °2 C 1 °1 C 1 °7 C 1 C 1 j C 2 C 2 C 4 C 2 C 2 C 1 C 2°1 C 2 C 2 C 4 C 2 C 2 C 3j C 4 C 2 C 3 C 1 C 3°3 C 3 C 1 



v.. 



A. 



2050 



| S | > s seconds 



c c c 

x y x 



c c c 

xxx 



c c c c c 

xxx x x 



C x C x C y C y C x -> 

c x c y c y c x c x c x c x c x c x c x 

C x C X C y C y C y C x C X ^ C x C x C x C x C x C x C x 



1906 



2055 
2060 
2065 
2070 
2075 



1910 


1915 


1920 

► 


US President 




European Union 


weather 




2 c 2 c 2 c A c 2 c 2 c % c 2 c^ c 2 c 2 c A c 2 c z c 3 < 


^ 2 c 4 c 2 c 3 c^c s c 3 c^c s c 3 c 3 c^ 

► 









2000 



t, n, w 
1906 



audio -- silence 

audio ~ speaker change 

speech transcript - end-of-sentence 

visual information — shot break 

closed-captioning — "»" 



2005 
2010 
2015 
2020 
2025 



B 



Figure 20 



2105 M{f) 



2110 



FrF{t,) 



2100 



Application process 



1860 ; 


2115 


c,.=c(u = srg 




r 


Aggregation process 


T 




min S L ( C /5 C.^ ) 


( C v C 2 , C r ) /= 1 




1880 


2120 


C, = C(f,; = S(r,; 







A 



Training 
data 



2130 



Heuristic 



rules 



2135 



Multi modal model 
building 



2150 



2140 

Multi modality 
costs 

L(C f ,C,) 
L{C 1 ,C 2 ) 
L(C 2 ,C 2 ) 
L ( C 2 , C, ) 



B 



Figure 21 



: f s "sport" L(sport, sport) = small 

| F* "sport or disaster" L(sport, disaster) = large 

i F d -> "disaster" ■ L(disaster, sport) = large 

"~rW.c ' L(disaster, disaster) = small 

2205 

2210 



r 



2215 



2220 



"golf game" 
F. 2225 



2230 



grass 



F. 2235 



2240 




green 



Spee ch: 

"golf club" 



F. 2245 



j Interpretation: 

! sport -» sport -»■ sport -» sport 2250 : 



2260 



2255 ^ 



Speech: 



"golf" 



F 2265 

S 

Interpretation: 
sport -» disaster 



2270 



Speech: 
"smoke" 



F 2275 

d 



disaster 2290 



2280 




'Speech: 
"wet" 



R 2285 



Figure 22 



